In [1]:
from rasterstats import raster_stats
elev = '/data/projects/murdock/zonal_fcid_test/dem_aea2_feet.tif'
polys = '/data/projects/murdock/zonal_fcid_test/fcid.shp'
stats = raster_stats(polys, elev, stats="*")
len(stats)


Out[1]:
20763

We can ask interesting questions such as "What polygon has the highest standard deviation in elevation?" by sorting the list by the std key.


In [2]:
sorted(stats, key=lambda k: k['std'], reverse=True)[0]


Out[2]:
{'__fid__': 17038,
 'count': 1087,
 'majority': 1523.3047,
 'max': 2710.309814453125,
 'mean': 2199.236430542778,
 'median': 2233.90380859375,
 'min': 1457.9298095703125,
 'std': 339.93001065442024,
 'sum': 2390570.0}

Let's see how this performs with a 4.8 MB raster and a 3.3MB shapefile with > 20k polygons.


In [3]:
%timeit raster_stats(polys, elev)


1 loops, best of 3: 15.1 s per loop

Compare that to other alternatives:

  • QGIS Zonal Statistics Plugin (only does count, sum and mean): 1 min 51 sec

Much faster, plus rasterstats is running in a VM in this case!

Let's try to optimize our raster_stats call by using global_src_extent=True. This will load the raster into memory once for the entire extent of the vector layer; less disk reads can mean better performance if raster access from disk is your limiting factor and you can fit the raster and resulting temporary arrays into memory!


In [4]:
%timeit raster_stats(polys, elev, global_src_extent=True)


1 loops, best of 3: 14.9 s per loop

No improvement at all. This indicates that the raster data source is relatively quick to read from disk. For other formats (such as jpeg or networked data sources) where the pixel values are slower to read from disk, the global_src_extent can increase performace.